Milestone 1

Context: The used car market is outpacing the new car market in sales. How do we determine how to price old cars for maximum profit?

Objective: Our goal is to design a pricing model that can effectively predict the price of used cars and help the business in devising profitable strategies using differential pricing.

Key Questions: What key features determine car value? How can we use regression to determine maximum profitability?

Data Dictionary

S.No. : Serial Number

Name : Name of the car which includes Brand name and Model name

Location : The location in which the car is being sold or is available for purchase (Cities)

Year : Manufacturing year of the car

Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)

Transmission : The type of transmission used by the car. (Automatic / Manual)

Owner : Type of ownership

Mileage : The standard mileage offered by the car company in kmpl or km/kg

Engine : The displacement volume of the engine in CC.

Power : The maximum power of the engine in bhp.

Seats : The number of seats in the car.

New_Price : The price of a new car of the same model in INR 100,000

Price : The price of the used car in INR 100,000 (Target Variable)

Library Import

Load Data

Exploratory Data Analysis

Summary Statistics Overview

Year: 1996 - 2019
Kil_Driv: High std may be due to outliers. Max value is an outlier
Mileage: average is ~18
Engine: may have some outliers that skew distritbution
Power: has a high max value compared to 75th quartile 
Seat: most cars have 5. min 2 - max 10.
New_price: most values missing; major imputation strategy needed
Price: min price is .44 and max is 160. 50% quartile is 5.64

Unique values Observations

S.No. has 7253 values. We can drop this column as it will not add to analysis.

Dropping 'S.No.' Column

We are missing data for:

Mileage
Engine
Power
Seats
New_price
Price

Missing values imputation

There are many values missing across the abovementioned categories. To begin the imputation process. We will create new columns for Brand and Model by splitting Name column. This data engineering feature will allow us to better impute values via groupby.

Imputing missing Mileage data

Imputing missing Engine data

Imputing missing Power data

Imputing missing Seats data

Imputing missing New_price data

Imputing missing New_price data

Review new summary statistics for major changes

None found.

Categorical Data Analysis

Top 5 Car Brands

Maruti           19.909003
Hyundai          18.475114
Honda            10.244037
Toyota            6.990211
Mercedes-Benz     5.239211
Volkswagen        5.156487

Top 5 Car Models

XUV500 W8 2WD                     0.758307
Swift VDI                         0.675583
Swift Dzire VDI                   0.579071
City 1.5 S MT                     0.537709
Swift VDI BSIV                    0.510134

Top 5 Car Names

Mahindra XUV500 W8 2WD                  0.758307
Maruti Swift VDI                        0.675583
Maruti Swift Dzire VDI                  0.579071
Honda City 1.5 S MT                     0.537709
Maruti Swift VDI BSIV                   0.510134

Top 5 Car Locations

Mumbai        13.084241
Hyderabad     12.077761
Coimbatore    10.643872
Kochi         10.643872
Pune          10.547360

Univariate Analysis - Numerical Data

Numerical Data for Analysis

Year                   int64 - considered an 'object'
Kilometers_Driven      int64
Mileage              float64
Engine               float64
Power                float64
Seats                float64
New_price            float64
Price                float64

Kilometers_Driven Histogram

Kilometers_Driven has an extreme outlier we will drop. This may be an input error.

To create a normal distribution for our Linear Regression, we will transform Kilometers_Driven by Log

kilometers_driven_log now has a normal distribution for use in our LR model.

Mileage Histogram

Mileage has a normal distribution. We also see two top mode values.

Engine Histogram

Power Histogram

To create a normal distribution for our Linear Regression, we will transform Power by Log

New_price Histogram

Price Histogram

Univariate Analysis - Categorical Data

Categorical Values

Name                  object
Location              object
Fuel_Type             object
Transmission          object
Owner_Type            object

Observations

There are 2 top brands in the Indian market: Maruti and Hyundai
Location: Mumbai, Hyderabad and Coimbatore have the highest sales
Most cars are powered by Diesel or Petrol
Most cars are manual
Most buyers are First Time

Bivariate Analysis - Scatter Plot

Cars with higher mileage trend towards lower prices.

cars with low Kilometers_Driven are more expensive.

Cars with greater power see an increase in price.

Most cars are made with 5 seats. There appears to be a price point for all models.

Bivariate Analysis - Heat Map

Heat Map Observations

Power, Engine and Price has strong correlations. We've seen from our pairplot that Engine and Power have a positive correlation. These 2 features have an effect on the Price and New_Price of a used car at ~ 0.66 - .077.
Engine and Mileage have a negative relationship.The higher the mileage on the car may affect the Engine (and Power).

Bivariate Analysis - Box Plot

For our Box Plot, Price is the Y value for all plots. The X variables will be all of our categorical variables. We will also include an additional plot that removes the outliers.

Proposed approach

Potential Techniques

We will create a Linear Regression model to see if we can plot the price point predictions.
We will complete Ridge and Lasso Regression tests
We will complete a Decision Tree model
We will complete a Random Forest model

Overall Solution Design

Using the abovementioned techniques, we will prepare a train and test data set to see if the model correctly predicts the price.

Measures of Success

To test the success of our models we will review our r-square results. R-squared measures the strength of the relationship between our model and the dependent variable on a convenient 0 – 100% scale. The model with the best R-square percent, is wht we will go with to create our algorithm for future predictions.

Milestone 1

Context: The used car market is outpacing the new car market in sales. How do we determine how to price old cars for maximum profit?

Objective: Our goal is to design a pricing model that can effectively predict the price of used cars and help the business in devising profitable strategies using differential pricing.

Key Questions: What key features determine car value? How can we use regression to determine maximum profitability?

Data Dictionary

S.No. : Serial Number

Name : Name of the car which includes Brand name and Model name

Location : The location in which the car is being sold or is available for purchase (Cities)

Year : Manufacturing year of the car

Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.

Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)

Transmission : The type of transmission used by the car. (Automatic / Manual)

Owner : Type of ownership

Mileage : The standard mileage offered by the car company in kmpl or km/kg

Engine : The displacement volume of the engine in CC.

Power : The maximum power of the engine in bhp.

Seats : The number of seats in the car.

New_Price : The price of a new car of the same model in INR 100,000

Price : The price of the used car in INR 100,000 (Target Variable)

Library Import

Load Data

Exploratory Data Analysis

Summary Statistics Overview

Year: 1996 - 2019
Kil_Driv: High std may be due to outliers. Max value is an outlier
Mileage: average is ~18
Engine: may have some outliers that skew distritbution
Power: has a high max value compared to 75th quartile 
Seat: most cars have 5. min 2 - max 10.
New_price: most values missing; major imputation strategy needed
Price: min price is .44 and max is 160. 50% quartile is 5.64

Unique values Observations

S.No. has 7253 values. We can drop this column as it will not add to analysis.

Dropping 'S.No.' Column

We are missing data for:

Mileage
Engine
Power
Seats
New_price
Price

Missing values imputation

There are many values missing across the abovementioned categories. To begin the imputation process. We will create new columns for Brand and Model by splitting Name column. This data engineering feature will allow us to better impute values via groupby.

Imputing missing Mileage data

Imputing missing Engine data

Imputing missing Power data

Imputing missing Seats data

Imputing missing New_price data

Imputing missing New_price data

Review new summary statistics for major changes

None found.

Categorical Data Analysis

Top 5 Car Brands

Maruti           19.909003
Hyundai          18.475114
Honda            10.244037
Toyota            6.990211
Mercedes-Benz     5.239211
Volkswagen        5.156487

Top 5 Car Models

XUV500 W8 2WD                     0.758307
Swift VDI                         0.675583
Swift Dzire VDI                   0.579071
City 1.5 S MT                     0.537709
Swift VDI BSIV                    0.510134

Top 5 Car Names

Mahindra XUV500 W8 2WD                  0.758307
Maruti Swift VDI                        0.675583
Maruti Swift Dzire VDI                  0.579071
Honda City 1.5 S MT                     0.537709
Maruti Swift VDI BSIV                   0.510134

Top 5 Car Locations

Mumbai        13.084241
Hyderabad     12.077761
Coimbatore    10.643872
Kochi         10.643872
Pune          10.547360

Univariate Analysis - Numerical Data

Numerical Data for Analysis

Year                   int64 - considered an 'object'
Kilometers_Driven      int64
Mileage              float64
Engine               float64
Power                float64
Seats                float64
New_price            float64
Price                float64

Kilometers_Driven Histogram

Kilometers_Driven has an extreme outlier we will drop. This may be an input error.

To create a normal distribution for our Linear Regression, we will transform Kilometers_Driven by Log

kilometers_driven_log now has a normal distribution for use in our LR model.

Mileage Histogram

Mileage has a normal distribution. We also see two top mode values.

Engine Histogram

Power Histogram

To create a normal distribution for our Linear Regression, we will transform Power by Log

New_price Histogram

Price Histogram

Univariate Analysis - Categorical Data

Categorical Values

Name                  object
Location              object
Fuel_Type             object
Transmission          object
Owner_Type            object

Observations

There are 2 top brands in the Indian market: Maruti and Hyundai
Location: Mumbai, Hyderabad and Coimbatore have the highest sales
Most cars are powered by Diesel or Petrol
Most cars are manual
Most buyers are First Time

Bivariate Analysis - Scatter Plot

Cars with higher mileage trend towards lower prices.

cars with low Kilometers_Driven are more expensive.

Cars with greater power see an increase in price.

Most cars are made with 5 seats. There appears to be a price point for all models.

Bivariate Analysis - Heat Map

Heat Map Observations

Power, Engine and Price has strong correlations. We've seen from our pairplot that Engine and Power have a positive correlation. These 2 features have an effect on the Price and New_Price of a used car at ~ 0.66 - .077.
Engine and Mileage have a negative relationship.The higher the mileage on the car may affect the Engine (and Power).

Bivariate Analysis - Box Plot

For our Box Plot, Price is the Y value for all plots. The X variables will be all of our categorical variables. We will also include an additional plot that removes the outliers.

Proposed approach

Potential Techniques

We will create a Linear Regression model to see if we can plot the price point predictions.
We will complete Ridge and Lasso Regression tests
We will complete a Decision Tree model
We will complete a Random Forest model

Overall Solution Design

Using the abovementioned techniques, we will prepare a train and test data set to see if the model correctly predicts the price.

Measures of Success

To test the success of our models we will review our r-square results. R-squared measures the strength of the relationship between our model and the dependent variable on a convenient 0 – 100% scale. The model with the best R-square percent, is wht we will go with to create our algorithm for future predictions.

Milestone 2

Model Building

  1. What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
  2. Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like - Name
  3. We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
  4. Build Regression models using train data.
  5. Evaluate the model performance.

Split Data

  • Step1: Split the data into X and Y .

  • Step2: Encode the categorical variables in X using pd.dummies.

  • Step3: Split the data into train and test using train_test_split.

  • Fitting a linear model

    Linear Regression can be implemented using:

    1) Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
    2) Statsmodels: https://www.statsmodels.org/stable/regression.html

    Linear Regression R-Square Observations

    Our LR model has a poor performance.
    The trained set has -0.21
    The test set has 0.83

    Important variables of Linear Regression

    Building a model using statsmodels

    Ridge Regression

    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

    Ridge Regression R-Square Observations

    The R square values are still poor for the Ridge Regression model. We may have completed a log to log transformation which is yielding a negative training result. The R square value has improved from our LR model, however, with a negative fit on the training data, we cannot trust the test data to be accurate.

    Lasso Regression

    Lasso Regression R-Square Observations

    The R square values are still poor for the Lasso Regression model. We may have completed a log to log transformation which is yielding a negative training result. The R square value has de-proved from our LR and RR model.

    Decision Tree

    https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

    Decision Tree R-Square Observations

    The R square values from our Decision Tree so great improvement from our previous models. The training set is currently overfitted at 0.99. Our test set has seen a step improvement and is reporting at 0.78. This may be the preferred model.
    
    The RMSE on the training set is overfitted. The RMSE on the test set has a large number. This may be due to an incorrect log fit in the engineering phase or in the set up of our X and Y values.

    Print the importance of features in the tree building ( The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

    Decision Tree Important Features Observations:

    The Data Engineering of New_price may be causing the poor R square values we are seeing in our model. We will need to review our imputations to be sure. I do not believe New_price should be an important feature in this analysis as most of the values were imputed by Brand, Model and Year.

    Random Forest

    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

    Random Forest R-Square Observations

    The R square values from our Random Forest show great improvement from our Decision Tree model. The training set is overfitted at 0.97. Our test set has seen a step improvement and is reporting at 0.84. This may be the preferred model over Decision Tree.
    
    Let us review feature importance and compare to the Decision Tree.

    Random Forest Important Features Observations:

    I do not believe New_price should be an important feature in this analysis as most of the values were imputed by Brand, Model and Year. Engine, Power, Kilometers_driven_log all showed a true correlation to price in our Bivariate Analysis.

    Hyperparameter Tuning: Decision Tree

    Decision Tree Tuned R-Square Observations

    The dtree_tuned model shows dramatic improvement from its predecessor dtree. The model is still overfitted, but it is steadily learning.

    Hyperparameter Tuning: Random Forest

    Random Forest Tuned R-Square Observations

    The R square values from our Random Forest tuned are the same from our non-tuned Random Forest model.
    
    Let us review feature importance and compare to the original Random Forest features of importance.

    Random Forest Tuned Important Features Observations:

    The features of importance across all models are New_price, power_log, and Year in varying percentages.

    All Models Comparison

    Out of the 7 predictive models, we found the Random Forest model performed best at ~89%.